In this notebook I'll show how I used a pre-trained model to classify audio samples to 10 different categories.
Date from https://urbansounddataset.weebly.com/urbansound8k.html
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import librosa
import librosa.display
from IPython.display import Image
from IPython.display import display
from IPython.display import Audio
First thing is to check how many samples we have per each label (are they balanced?)
base_dir = r'C:\Users\USER1\Desktop\urban_sound\UrbanSound8K.tar\UrbanSound8K'
matadate_file = os.path.join(base_dir, r'metadata\UrbanSound8K.csv')
metadata = pd.read_csv(matadate_file)
label_counts = metadata['class'].value_counts()
plt.figure(figsize = (12,6))
sns.set_context("notebook", font_scale=1.2)
sns.barplot(label_counts.index, label_counts.values, alpha = 0.9)
plt.xticks(rotation = 'vertical')
plt.xlabel('Image Labels', fontsize =16, labelpad=20)
plt.ylabel('Counts', fontsize = 16)
plt.tight_layout()
plt.title('Label counts', fontsize=20)
We see that for labels "car horn" and "gun shot" there are much fewer samples compared than the rest. Later we'll check whether it's a problem or not.
Now, let's have a look at the Audio files.
wav_name = []
fold = []
wav_list = []
sr_list = []
audio_list = []
labels = metadata['class'].unique()
for label in labels:
my_rows = metadata.loc[:,'class'] == label
wav_name.append(list(metadata.loc[my_rows, 'slice_file_name'][0:3]))
fold.append(list(metadata.loc[my_rows, 'fold'][0:3]))
wav_name = np.array(wav_name).flatten()
fold = np.array(fold).flatten()
for i in range(len(wav_name)):
wav_file = os.path.join(base_dir,'audio', 'fold'+str(fold[i]), wav_name[i])
y, sr = librosa.load(wav_file)
audio_list.append(y)
sr_list.append(sr)
wav_list.append(wav_file)
plt.figure()
fig, axes = plt.subplots(nrows=10, ncols=3, figsize=(15,25))
fig.subplots_adjust(hspace=1.5, top=0.95)
fig.suptitle('Audio waveforms')
my_label = 0
for i, y, sr in zip(range(0,30), audio_list, sr_list):
i+=1
plt.subplot(10, 3, i)
librosa.display.waveplot(y, sr=sr)
plt.xlabel('Time (Sec)')
if i % 3 ==1:
plt.ylabel(labels[my_label], fontsize=20)
my_label +=1
We see how different labels have different waveforms. Some labels have very distinct pattern and easy to differentiate from the rest (like "siren" or "dog bark") while for other labels it seems impossible (can you tell is it "drilling" or "air conditioner"?). Notice that even samples of the same label are not always similar (for example, the first "dog bark" plot is not like the two others).
When it comes to audio, it's obvious that the amplitude itself is not enough. The frequencies of the sound waves contain a lot of information which we can use.
Using short-time Fourier transform (STFT) we can analyze data both in time domain and frequency domain simultaneously.
plt.figure()
fig, axes = plt.subplots(nrows=10, ncols=3, figsize=(20,45))
fig.subplots_adjust(hspace=1)
fig.suptitle('Spectrograms')
my_label = 0
for i, y, sr in zip(range(0,30), audio_list, sr_list):
i+=1
plt.subplot(10, 3, i)
log_S = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
librosa.display.specshow(log_S, sr=sr, x_axis='time', y_axis='log')
locs, y_labels = plt.yticks()
fewer_locs = locs[::2]
plt.yticks(fewer_locs)
if i % 3 ==1:
plt.ylabel(labels[my_label], fontsize=20)
my_label +=1
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
One commonly used method in audio analysis is to transform the frequencies (Hz) to the Mel scale. The Mel-scaled is based on human perception experiments which found that human ear can discriminate better at lower frequencies and less at higher frequencies.
plt.figure()
fig, axes = plt.subplots(nrows=10, ncols=3, figsize=(20,40))
fig.subplots_adjust(hspace=1)
fig.suptitle('Spectrograms')
my_label = 0
for i, y, sr in zip(range(0,30), audio_list, sr_list):
i+=1
plt.subplot(10, 3, i)
S = librosa.feature.melspectrogram(y, sr=sr, n_mels=128)
log_S = librosa.amplitude_to_db(S, ref=np.max)
librosa.display.specshow(log_S, sr=sr, x_axis='time', y_axis='mel')
if i % 3 ==1:
plt.ylabel(labels[my_label], fontsize=20)
my_label +=1
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
The spectrograms are better visual representations of the signal than the waveforms and the mel-scaled spectrograms are even better. With the mel-scaled spectograms, it's easier to visualize similarity of samples within the same label and the differences between labels.
Now, when we got images we can use a pre-trained neural network that specialized in image classification. Here I used "Inception" which was already trained to classify 1,000 different classes of the "Imagenet" dataset.
We will only retrain weights of the last layer of the model, just before the softmax layer gives a score for each label.
I split the dataset as follow:
We will optimize the model on the validation set and only at the final step we will use the test set to test the model's performances.
We will answer those questions:
x = Image('raw_vs_mel.PNG') # Visualizing model performance was done using tensorboard
display(x)
Orange = Training - Mel-scale
Purple = Training - Hertz
Cyan = Validation - Mel-scale
Blue = Validation - Hertz
Training with mel-scaled spectrograms yields better results than using raw spectrograms. Interestingly, that mel-scaled spectrograms capture more meaningful information than the raw spectrograms even though the mel-scale is aimed to mimic perceptual aspects of the human auditory system's. Seems that what works for humans, works for machines as well.
In the next analyses, we'll continue only with the mel-scaled spectrograms.
We see how accuracy goes higher on both training set and validation set until 30K steps.
Does the model stop learning? maybe we need to train longer?
x = Image('100k_acc.PNG')
display(x)
x = Image('100kloss.PNG')
display(x)
Orange = Training
Cyan = Validation
The model has quite nice performance regard the fact that most of the training was on completely different images. Still, it doesn't generalize very well. The training loss keeps decreasing while validation loss goes on a plateau around step 50K. The difference between the two is the known "generalization gap" which we want to reduce. Worth mentioning this is not an overfitting issue since the validation loss is not increasing in time.
In the next validation experiments, I'll run the model up to 30K only to save running time, but before the final test, I'll do train the model using 50K steps.
So, what can we do to generalize better?
There are several methods to enhance generalization of a model. Some of them (Dropout, Batch normalization, and Regularization) are not relevant in our case since we are working on a pre-trained model and most of the weights are fixed. Here I applied data augmentation during the training to make the model more robust to small changes and hence to perform better on the test set.
Usually, in image classification tasks, data augmentation is applied using rescaling, distortions, rotating, cropping, etc. Those are not appropriate here when the images actually represent audio signals. Instead, we can use another information hidden in the audio files as data augmentation. At the previous part, loading the audio files was done with default parameters so the two channels of stereo files were averaged together and then were processed to create one spectrogram. Now, each of the channels will be loaded separately and I created three images from one audio file: mono-left, mono-right, and the averaged signal.
Overall, we increased our training set by 2.4 (from 8732 images to 20,968 images).
We should check whether the augmentation process changed somehow the distribution over labels.
image_dir = r'C:\Users\USER1\Desktop\urban_sound\spectrograms_mel'
import os, os.path
aug_counts = []
for label in label_counts.index:
sub_path = os.path.join(image_dir,label)
aug_counts.append(len([file for file in os.listdir(sub_path) ]))
plt.figure(figsize = (12,6))
sns.set_context("notebook", font_scale=1.2)
sns.barplot(label_counts.index, aug_counts, alpha = 0.9)
plt.xticks(rotation = 'vertical')
plt.xlabel('Image Labels', fontsize =16, labelpad=20)
plt.ylabel('Counts', fontsize = 16)
plt.tight_layout()
plt.title('Label counts after augmentation', fontsize=20)
The general distribution was kept as before when eight labels are with quite the same counts and two labels ("car horn" and "gun shot") with much fewer samples.
We are going to compare the 2 models, one based on the initial amount of data and the second including the new images after the channel separation process.
x = Image('aug_vs_basic.PNG')
display(x)
Orange = Training - original data
Purple = Training - augmented data
Blue = Validation - augmented data
Cyan = Validation - original data
Training with more data led to low accuracy on the training compared to the model with the original amount of data but, increased inference on the validation step by 2% (85% to 87%).
Next, we turn to explore how different values of learning rate and batch size will affect model performance.
Learning rate = [0.005, 0.01, 0.05]
Batch size = [50, 100, 200]
Since there is a known interplay between learning rate and batch size on model performance, I'll test each of their 9 combinations.
I found the best setup is when learning rate = 0.05 and batch size = 200 that gives accuracy = 91.3 % on the validation set.
Remember we work with unbalanced data?
Let's have a look at the precision-recall and their harmonic mean (f1-sore) for each label.
pd.read_table('precision recall_validation.txt')
We can see that although some categories have much fewer samples ("gun shot" and "car horn"), the model managed to learn their patterns better than other categories.
Now for the final test, how our model will manage on unseen samples?
Where the model was right and where not?
x = Image('testconfusuon_mat_normalized.jpg')
y = Image('testconfusuon_mat.jpg')
display(x, y)
We see that "gun shot" has the perfect accuracy (100%), with all 32 samples were correctly classified. On the other hand, "street music" has the lowest score (82%) with 11 false-positive events 16 false-negative.
Let's inspect some of those misclassified samples by plotting their spectrograms and listening to the audio.
ex1_wav = os.path.join(base_dir, 'audio', 'fold5', '109263-9-0-39.wav')
ex2_wav = os.path.join(base_dir, 'audio', 'fold2', '194841-9-0-48.wav')
ex3_wav = os.path.join(base_dir, 'audio', 'fold7', '105289-8-1-1.wav')
examples_files = [ex1_wav, ex2_wav, ex3_wav]
ex_labels = ['True: street music, Predicted: dog bark',
'True: street music, Predicted: siren',
'True: siren, Predicted: street music']
examples_files
i = 0
plt.figure()
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(20,40))
fig.subplots_adjust(hspace=1, top=0.95)
fig.suptitle('Misclassified')
for wav_file in examples_files:
y, sr = librosa.load(wav_file)
i+=1
plt.subplot(10, 3, i)
S = librosa.feature.melspectrogram(y, sr=sr, n_mels=128)
log_S = librosa.amplitude_to_db(S, ref=np.max)
librosa.display.specshow(log_S, sr=sr, x_axis='time', y_axis='mel')
plt.title(ex_labels[i-1])
ex1_wav = os.path.join(base_dir, 'audio', 'fold5', '109263-9-0-39.wav') # Originally it's a street music
Audio(ex1_wav)
ex2 = os.path.join(base_dir, 'audio', 'fold2', '194841-9-0-48.wav') # Originally it's a street music
Audio(ex2)
ex3 = os.path.join(base_dir, 'audio', 'fold7', '105289-8-1-1.wav') # Originally it's a siren
Audio(ex3)
From both spectrograms and audio, we can understand why the model inferred wrongly on those samples. Examples 1 and 2 are not a typical street music.
The third example is a siren, but apparently, the siren is too melodious and the model inferred it's street music.
However, it's a lack of generalization when samples that are not "typical" are misclassified.
I would continue with the directions I had already tried and found useful:
1) Data augmentation - manipulate signal on the audio domain, making more audio files and then create images from them. An appropriate manipulation can be to inject random noise into the audio signal.
2) Hyperparameter optimization - I would use adaptive learning rate methods (e.g Adam) instead of the classic SGD optimizer.